Predictive Modeling and Data Analysis of Breast Cancer

BENSON CYRIL NANA BOAKYE

1.0 Introduction¶

Breast cancer is a disease characterized by the uncontrolled growth of abnormal cells in the breast tissue, forming tumors that can potentially spread throughout the body. If not detected and treated promptly, these tumors can metastasize, leading to a fatal outcome. It arises in the lining cells (epithelium) of the ducts (85%) or lobules (15%) in the glandular tissue of the breast. The earliest form of the disease, known as in situ, is non-life-threatening and can often be detected at an early stage. However, once cancer cells invade surrounding breast tissue, they can form lumps and thickening, and may spread to nearby lymph nodes or other organs, leading to more severe and life-threatening conditions.

In 2022, there were 2.3 million new cases of breast cancer diagnosed worldwide, and approximately 670,000 deaths. The disease affects women of all ages after puberty, with increasing rates in later life. Global estimates reveal stark disparities in breast cancer incidence and mortality based on human development indices.

Female gender is the primary risk factor for breast cancer, with approximately 99% of cases occurring in women and 0.5–1% in men. Risk factors include increasing age, obesity, excessive alcohol consumption, family history of breast cancer, history of radiation exposure, reproductive history, tobacco use, and postmenopausal hormone therapy.

In the early stages, breast cancer may not present any noticeable symptoms, highlighting the importance of early detection. As the disease progresses, symptoms may include a breast lump or thickening, often without pain; changes in size, shape, or appearance of the breast; dimpling, redness, or other skin changes; alterations in the nipple or areola appearance; and abnormal or bloody discharge from the nipple.

1.1 Objective¶

This project seeks to address the question of improving breast cancer diagnostics by exploring how machine learning techniques can be applied to enhance the diagnostic process. Using image processing and manual measurements of cell characteristics from Fine Needle Aspiration (FNA) images, it aims to predict the probability that a diagnosed breast cancer case is malignant or benign. By leveraging machine learning algorithms, the project intends to create a tool that can assist healthcare professionals in making more accurate and timely diagnoses, ultimately contributing to better patient outcomes and advancing the field of cancer research. To achieve this, the project will utilize five classification models: Logistic Regression, K Nearest Neighbors (KNN), Random Forests, Support Vector Machines (SVM) with the RBF kernel, and Gradient Boosting.

1.2 Data Description¶

This dataset contains information from digitized images of Fine Needle Aspirate (FNA) samples of breast masses. The images are analyzed to extract features that describe the characteristics of cell nuclei. This dataset is based on a study by K. P. Bennett and O. L. Mangasarian.

Attribute Information:

  • ID number: Unique identifier for each sample.
  • Diagnosis: Class label indicating whether the sample is malignant (M) or benign (B).

Features: 3-32: Ten real-valued features are computed for each cell nucleus, with three measurements for each feature (mean, standard error, and worst/largest value):

  • Radius: Average distance from the center to the edge.
  • Texture: Variation in gray-scale values.
  • Perimeter: Length around the edge.
  • Area: Size of the cell nucleus.
  • Smoothness: Variation in radius lengths.
  • Compactness: Perimeter squared divided by the area minus 1.
  • Concavity: Depth of the indentations on the edge.
  • Concave Points: Number of indentations.
  • Symmetry: Degree of symmetry.
  • Fractal Dimension: Measure of the cell's complexity.

2.0 Exploratory Data Analysis¶

Exploratory Data Analysis (EDA) is a technique used to analyze datasets to extract characteristic information and obtain a comprehensive overview of the data's features. This process helps in discovering patterns, spotting anomalies, and testing hypotheses using basic statistical exploration and visualization tools.

2.1 Understanding the data¶

In [1]:
#Loading my Libraries
from glob import glob
import pandas as pd
import statsmodels.api as sm
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings('ignore')
from scipy import stats
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.utils.validation import check_is_fitted
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.impute import SimpleImputer
from category_encoders import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
In [2]:
# Reading data into Data Frame
df = pd.read_csv('data.csv')
In [3]:
# Display the first 5 rows

df.head()
Out[3]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 NaN
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 NaN

5 rows × 33 columns

In [4]:
# Display the last 5 rows

df.tail()
Out[4]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
564 926424 M 21.56 22.39 142.00 1479.0 0.11100 0.11590 0.24390 0.13890 ... 26.40 166.10 2027.0 0.14100 0.21130 0.4107 0.2216 0.2060 0.07115 NaN
565 926682 M 20.13 28.25 131.20 1261.0 0.09780 0.10340 0.14400 0.09791 ... 38.25 155.00 1731.0 0.11660 0.19220 0.3215 0.1628 0.2572 0.06637 NaN
566 926954 M 16.60 28.08 108.30 858.1 0.08455 0.10230 0.09251 0.05302 ... 34.12 126.70 1124.0 0.11390 0.30940 0.3403 0.1418 0.2218 0.07820 NaN
567 927241 M 20.60 29.33 140.10 1265.0 0.11780 0.27700 0.35140 0.15200 ... 39.42 184.60 1821.0 0.16500 0.86810 0.9387 0.2650 0.4087 0.12400 NaN
568 92751 B 7.76 24.54 47.92 181.0 0.05263 0.04362 0.00000 0.00000 ... 30.37 59.16 268.6 0.08996 0.06444 0.0000 0.0000 0.2871 0.07039 NaN

5 rows × 33 columns

In [5]:
# Check the shape of the dataset

df.shape
Out[5]:
(569, 33)
In [6]:
# Display column names

df.columns
Out[6]:
Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')
In [7]:
# Get data types of each column

df.dtypes
Out[7]:
id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
Unnamed: 32                float64
dtype: object
In [8]:
# Checking the distribution of categorical variables

categorical_columns = df.select_dtypes(include=['object']).columns
for column in categorical_columns:
    print(f'\nDistribution of {column}:')
    print(df[column].value_counts())
Distribution of diagnosis:
diagnosis
B    357
M    212
Name: count, dtype: int64
In [9]:
#Checking for Duplicate rows
df.duplicated().sum()
Out[9]:
0
In [10]:
# Check for missing values

df.isnull().sum()
Out[10]:
id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed: 32                569
dtype: int64

It can be seen that the whole column "unnamed: 32" has NaN values.

In [11]:
# dropping 'Unnamed: 32' column.

df.drop("Unnamed: 32", axis=1, inplace=True)
In [12]:
# dropping id column (Wont be Important in this analysis)

df.drop('id', axis=1, inplace=True)
In [13]:
# descriptive statistics of data

df.describe()
Out[13]:
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 30 columns

2.2 Data Visualizations¶

In [14]:
# Plot histograms for numerical columns
import matplotlib.pyplot as plt

# Plot histograms for numerical columns with all histograms in green
df.hist(bins=20, figsize=(20, 15))
plt.show()

Several features, such as radius_mean, perimeter_mean, and area_mean, exhibit significant right skewness, indicating the presence of outliers with large values. Conversely, features like texture_mean, smoothness_mean, symmetry_mean, and fractal_dimension_mean display distributions that are more symmetric and closer to normal.

The standard error features (_se) also show right-skewed distributions, suggesting lower values for most instances with a few higher outliers. The "worst" features (_worst) present wide distributions with varying degrees of skewness, reflecting the largest measurements.

In [15]:
#Create Countplot of Diagnosis
ax = sns.countplot(x='diagnosis', data=df, palette=['#FF9999','#66B2FF'])

# Set title
plt.title('Countplot of Diagnosis')

# Create custom legend
from matplotlib.patches import Patch
legend_labels = ['Malignant', 'Benign']
legend_colors = ['#FF9999','#66B2FF']
handles = [Patch(color=color, label=label) for color, label in zip(legend_colors, legend_labels)]
plt.legend(handles=handles, title='Diagnosis')

# Show the plot
plt.show()

The countplot displays the distribution of breast cancer diagnoses in the dataset, highlighting that there are more benign cases (357, shown in blue) than malignant cases (212, shown in red). The plot provides a clear visual comparison between the two categories, indicating that benign diagnoses are more prevalent in this dataset.

In [16]:
# Calculate the correlation matrix

df_numeric = df.drop(columns=['diagnosis'])

correlation_matrix = df_numeric.corr()

# Plot heatmap for the correlation matrix
plt.figure(figsize=(20, 18))
sns.heatmap(correlation_matrix, annot=True,  linewidths=.5, cmap='coolwarm', center=0)
plt.show()

The heatmap shows the correlation matrix for the features in the breast cancer dataset. High positive correlations (closer to 1) are highlighted in dark red, while negative correlations (closer to -1) are in blue. It reveals strong correlations among certain features, such as 'radius_mean', 'perimeter_mean', and 'area_mean', indicating that they tend to increase together. This visualization helps in understanding the relationships between different features, which is crucial for feature selection and model building.

In [17]:
# Distribution plots (distplots) show the distribution of numerical features.
num_features = len(df.columns) - 2
num_cols = 3
num_rows = (num_features + num_cols - 1) // num_cols

fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, num_rows * 5))
fig.tight_layout(pad=3.0)

axes = axes.flatten()

for i, column in enumerate(df.columns[2:]):
    sns.kdeplot(df[column], ax=axes[i], color='orange', fill=True, linewidth=2)  # Fill KDE plot with red color
    axes[i].set_title(f'Distribution of {column}')

# Remove any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.show()

Most features, such as radius_mean, perimeter_mean, and area_mean, exhibit right-skewed distributions, indicating a concentration of lower values with fewer higher values. Features like texture_mean and smoothness_mean show more symmetric distributions, suggesting a more uniform spread. The standard error features are tightly clustered, indicating low variability within cell measurements. The 'worst' case features, such as radius_worst and texture_worst, highlight the extreme values in the dataset.

2.3 Multivariate Analysis of Features In Same Category¶

2.3.1 Mean Measurements of Cancer Cells¶

In [18]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

mean_of_measurements= ['diagnosis','radius_mean' , 'perimeter_mean' , 'area_mean' , 'concavity_mean' , 'concave points_mean']

custom_palette = sns.color_palette("colorblind", 2)
sns.pairplot(df[mean_of_measurements], hue='diagnosis', palette=custom_palette)
Out[18]:
<seaborn.axisgrid.PairGrid at 0x153e8e150>

2.3.2 Mean Characteristics of Cancer Cells¶

In [19]:
mean_of_chracteristics= ['diagnosis','texture_mean', 'smoothness_mean', 'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean']


custom_palette = sns.color_palette("colorblind", 2)

sns.pairplot(df[mean_of_chracteristics], hue='diagnosis', palette=custom_palette)
Out[19]:
<seaborn.axisgrid.PairGrid at 0x1530ef850>

2.3.3 Standard Error of Measurements¶

In [20]:
Standard_Error_of_Measurements= ['diagnosis','radius_se', 'perimeter_se', 'area_se', 'concavity_se', 'concave points_se']
custom_palette = sns.color_palette("colorblind", 2)
sns.pairplot(df[Standard_Error_of_Measurements], hue='diagnosis', palette=custom_palette)
Out[20]:
<seaborn.axisgrid.PairGrid at 0x16b5e1dd0>

2.3.4 Standard Error of Characterisitcs¶

In [21]:
Standard_Error_of_chracteristics= ['diagnosis','texture_se', 'smoothness_se', 'compactness_se', 'symmetry_se', 'fractal_dimension_se']
custom_palette = sns.color_palette("colorblind", 2)
sns.pairplot(df[Standard_Error_of_chracteristics], hue='diagnosis', palette=custom_palette)
Out[21]:
<seaborn.axisgrid.PairGrid at 0x16c8d1d10>

2.3.5 Worst of Measurements¶

In [22]:
Worst_of_Measurements= ['diagnosis','radius_worst', 'perimeter_worst', 'area_worst', 'concavity_worst', 'concave points_worst']
custom_palette = sns.color_palette("colorblind", 2)
sns.pairplot(df[Worst_of_Measurements], hue='diagnosis', palette=custom_palette)
Out[22]:
<seaborn.axisgrid.PairGrid at 0x16d5dce90>

2.3.6 Worst of Characteristics¶

In [23]:
Worst_of_Characteristics= ['diagnosis','texture_worst', 'smoothness_worst', 'compactness_worst', 'symmetry_worst', 'fractal_dimension_worst']
custom_palette = sns.color_palette("colorblind", 2)
sns.pairplot(df[Worst_of_Characteristics], hue='diagnosis', palette=custom_palette)
Out[23]:
<seaborn.axisgrid.PairGrid at 0x16d679990>

3.0 Data Processing and Models Building¶

3.1 Processing of Data¶

In [24]:
# Convert 'diagnosis' to numeric
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})

# Count the occurrences of each diagnosis type
diagnosis_counts = df['diagnosis'].value_counts()
print("\nCounts of each diagnosis type:")
print(diagnosis_counts)
Counts of each diagnosis type:
diagnosis
0    357
1    212
Name: count, dtype: int64
In [25]:
# selecting features based on correlation threshold
correlation_threshold = 0.75
correlation_matrix = df.corr()
upper_tri = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))
high_correlation_features = [column for column in upper_tri.columns if any(upper_tri[column] > correlation_threshold)]
print("Highly correlated features:")
print(high_correlation_features)
Highly correlated features:
['perimeter_mean', 'area_mean', 'concavity_mean', 'concave points_mean', 'perimeter_se', 'area_se', 'concavity_se', 'concave points_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'fractal_dimension_worst']

3.2 Partitioning Data for Model Training and Validation ( Training and Testing)¶

In [26]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('diagnosis', axis=1),
    df['diagnosis'],
    test_size=0.2,
    random_state=42
)

# Print the shapes of the resulting datasets
print("Shape of training set (features):", X_train.shape)
print("Shape of test set (features):", X_test.shape)
print("Shape of training set (target):", y_train.shape)
print("Shape of test set (target):", y_test.shape)
Shape of training set (features): (455, 30)
Shape of test set (features): (114, 30)
Shape of training set (target): (455,)
Shape of test set (target): (114,)
In [27]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
X_test_scaled = scaler.transform(X_test)

print("Scaled Training Data:")
print(X_train_scaled)

print("\nScaled Test Data:")
print(X_test_scaled)
Scaled Training Data:
[[-1.44075296 -0.43531947 -1.36208497 ...  0.9320124   2.09724217
   1.88645014]
 [ 1.97409619  1.73302577  2.09167167 ...  2.6989469   1.89116053
   2.49783848]
 [-1.39998202 -1.24962228 -1.34520926 ... -0.97023893  0.59760192
   0.0578942 ]
 ...
 [ 0.04880192 -0.55500086 -0.06512547 ... -1.23903365 -0.70863864
  -1.27145475]
 [-0.03896885  0.10207345 -0.03137406 ...  1.05001236  0.43432185
   1.21336207]
 [-0.54860557  0.31327591 -0.60350155 ... -0.61102866 -0.3345212
  -0.84628745]]

Scaled Test Data:
[[-0.46649743 -0.13728933 -0.44421138 ... -0.19435087  0.17275669
   0.20372995]
 [ 1.36536344  0.49866473  1.30551088 ...  0.99177862 -0.561211
  -1.00838949]
 [ 0.38006578  0.06921974  0.40410139 ...  0.57035018 -0.10783139
  -0.20629287]
 ...
 [-0.73547237 -0.99852603 -0.74138839 ... -0.27741059 -0.3820785
  -0.32408328]
 [ 0.02898271  2.0334026   0.0274851  ... -0.49027026 -1.60905688
  -0.33137507]
 [ 1.87216885  2.80077153  1.80354992 ...  0.7925579  -0.05868885
  -0.09467243]]

4.0 Classification Models¶

4.1 K-Nearest Neighbours (KNN)¶

In [28]:
from sklearn.neighbors import KNeighborsClassifier

# List to store error rates
error_rate = []

# Iterate over possible values for n_neighbors
for i in range(1, 42):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train_scaled, y_train)  # Use scaled data
    pred_i = knn.predict(X_test_scaled)  # Use scaled data
    error_rate.append(np.mean(pred_i != y_test))

# Print or analyze the error rates to find the optimal number of neighbors
print("Error Rates for different values of k:")
print(error_rate)
Error Rates for different values of k:
[0.06140350877192982, 0.05263157894736842, 0.05263157894736842, 0.043859649122807015, 0.05263157894736842, 0.043859649122807015, 0.05263157894736842, 0.043859649122807015, 0.03508771929824561, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.05263157894736842, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015, 0.043859649122807015]

Visualize the Error Rates to find the optimal number of neighbors¶

In [29]:
plt.figure(figsize=(12, 6))
plt.plot(range(1, 42), error_rate, color='red', linestyle='--',
         marker='o', markersize=8, markerfacecolor='b')
plt.title('Error Rate vs. Number of Neighbors (k)')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Error Rate')
plt.grid(True)
plt.show()
In [30]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize the K-Nearest Neighbors classifier with the optimal k value
optimal_k = 9
knn = KNeighborsClassifier(n_neighbors=optimal_k)

# Train the model with the scaled training data
knn.fit(X_train_scaled, y_train)

# Make predictions with the scaled test data
y_pred = knn.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Print evaluation metrics
print("Accuracy Score:", accuracy)
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
Accuracy Score: 0.9649122807017544

Confusion Matrix:
[[69  2]
 [ 2 41]]

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        71
           1       0.95      0.95      0.95        43

    accuracy                           0.96       114
   macro avg       0.96      0.96      0.96       114
weighted avg       0.96      0.96      0.96       114

In [31]:
# Plotting the confusion matrix heatmap
plt.figure(figsize=(8, 6))
conf_matrix_knn = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix_knn, annot=True, fmt='d', cmap='coolwarm',
            xticklabels=['Class 0', 'Class 1'],
            yticklabels=['Class 0', 'Class 1'])
plt.title('Confusion Matrix Heatmap (K-Nearest Neighbors)')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
In [32]:
from sklearn.metrics import roc_curve, auc

# Compute ROC curve and AUC for KNN
fpr_knn, tpr_knn, _ = roc_curve(y_test, knn.predict_proba(X_test_scaled)[:, 1])
roc_auc_knn = auc(fpr_knn, tpr_knn)


# Plot ROC curve for KNN
plt.figure(figsize=(10, 6))
plt.plot(fpr_knn, tpr_knn, color='#1f77b4', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_knn)  # Blue color
plt.plot([0, 1], [0, 1], color='#ff7f0e', lw=2, linestyle='--')  # Orange color
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) - KNN')
plt.legend(loc='lower right')
plt.show()

4.2 Logistic Regression¶

In [33]:
# Initialize the Logistic Regression model
log_reg = LogisticRegression(max_iter=10000, random_state=42)

# Train the model
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_pred_log_reg = log_reg.predict(X_test_scaled)

# Evaluate the Logistic Regression model
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
conf_matrix_log_reg = confusion_matrix(y_test, y_pred_log_reg)
class_report_log_reg = classification_report(y_test, y_pred_log_reg)

# Print evaluation metrics
print("Accuracy Score (Logistic Regression):", accuracy_log_reg)
print("\nConfusion Matrix (Logistic Regression):")
print(conf_matrix_log_reg)
print("\nClassification Report (Logistic Regression):")
print(class_report_log_reg)
Accuracy Score (Logistic Regression): 0.9736842105263158

Confusion Matrix (Logistic Regression):
[[70  1]
 [ 2 41]]

Classification Report (Logistic Regression):
              precision    recall  f1-score   support

           0       0.97      0.99      0.98        71
           1       0.98      0.95      0.96        43

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

In [34]:
# Plotting the confusion matrix heatmap
plt.figure(figsize=(8, 6))
conf_matrix_log_reg = confusion_matrix(y_test, y_pred_log_reg)
sns.heatmap(conf_matrix_log_reg, annot=True, fmt='d', cmap='coolwarm',
            xticklabels=['Class 0', 'Class 1'],
            yticklabels=['Class 0', 'Class 1'])
plt.title('Confusion Matrix Heatmap (Logistic Regression)')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

4.3 Random Forest¶

In [35]:
# Initialize the Random Forest model
random_forest = RandomForestClassifier(n_estimators=400, random_state=42)

# Train the model
random_forest.fit(X_train_scaled, y_train)

# Make predictions
y_pred_rf = random_forest.predict(X_test_scaled)

# Evaluate the Random Forest model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
class_report_rf = classification_report(y_test, y_pred_rf)

# Print evaluation metrics
print("Accuracy Score (Random Forest):", accuracy_rf)
print("\nConfusion Matrix (Random Forest):")
print(conf_matrix_rf)
print("\nClassification Report (Random Forest):")
print(class_report_rf)
Accuracy Score (Random Forest): 0.9649122807017544

Confusion Matrix (Random Forest):
[[70  1]
 [ 3 40]]

Classification Report (Random Forest):
              precision    recall  f1-score   support

           0       0.96      0.99      0.97        71
           1       0.98      0.93      0.95        43

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

In [36]:
from sklearn.metrics import roc_curve, auc

# Compute ROC curve and AUC
fpr_rf, tpr_rf, _ = roc_curve(y_test, random_forest.predict_proba(X_test_scaled)[:, 1])
roc_auc_rf = auc(fpr_rf, tpr_rf)

# Plot ROC curve
plt.figure(figsize=(10, 6))
plt.plot(fpr_rf, tpr_rf, color='#1f77b4', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_rf)  # Blue color
plt.plot([0, 1], [0, 1], color='#ff7f0e', lw=2, linestyle='--')  # Orange color
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) - Random Forest')
plt.legend(loc='lower right')
plt.show()
In [37]:
from sklearn.metrics import precision_recall_curve

# Compute Precision-Recall curve and AUC
precision_rf, recall_rf, _ = precision_recall_curve(y_test, random_forest.predict_proba(X_test_scaled)[:, 1])
pr_auc_rf = auc(recall_rf, precision_rf)

# Plot Precision-Recall curve
plt.figure(figsize=(10, 6))
plt.plot(recall_rf, precision_rf, color='orange', lw=2, label='Precision-Recall curve (area = %0.2f)' % pr_auc_rf)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve - Random Forest')
plt.legend(loc='best')
plt.show()

4.4 Support Vector Machines (SVM)¶

In [38]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize the SVM model with an RBF kernel
svm_model_rbf = SVC(kernel='rbf', random_state=42)

# Train the model
svm_model_rbf.fit(X_train_scaled, y_train)

# Make predictions
y_pred_svm_rbf = svm_model_rbf.predict(X_test_scaled)

# Evaluate the SVM model
accuracy_svm_rbf = accuracy_score(y_test, y_pred_svm_rbf)
conf_matrix_svm_rbf = confusion_matrix(y_test, y_pred_svm_rbf)
class_report_svm_rbf = classification_report(y_test, y_pred_svm_rbf)

# Print evaluation metrics
print("Accuracy Score (SVM with RBF kernel):", accuracy_svm_rbf)
print("\nConfusion Matrix (SVM with RBF kernel):")
print(conf_matrix_svm_rbf)
print("\nClassification Report (SVM with RBF kernel):")
print(class_report_svm_rbf)
Accuracy Score (SVM with RBF kernel): 0.9824561403508771

Confusion Matrix (SVM with RBF kernel):
[[71  0]
 [ 2 41]]

Classification Report (SVM with RBF kernel):
              precision    recall  f1-score   support

           0       0.97      1.00      0.99        71
           1       1.00      0.95      0.98        43

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

In [39]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the confusion matrix heatmap
plt.figure(figsize=(8, 6))
conf_matrix_svm_rbf = confusion_matrix(y_test, y_pred_svm_rbf)
sns.heatmap(conf_matrix_svm_rbf, annot=True, fmt='d', cmap='coolwarm',
            xticklabels=['Class 0', 'Class 1'],
            yticklabels=['Class 0', 'Class 1'])
plt.title('Confusion Matrix Heatmap (SVM with RBF Kernel)')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
In [40]:
from sklearn.metrics import roc_curve, auc

# Compute ROC curve and AUC for SVM with RBF kernel
fpr_svm_rbf, tpr_svm_rbf, _ = roc_curve(y_test, svm_model_rbf.decision_function(X_test_scaled))
roc_auc_svm_rbf = auc(fpr_svm_rbf, tpr_svm_rbf)

# Plot ROC curve for SVM with RBF kernel
plt.figure(figsize=(10, 6))
plt.plot(fpr_svm_rbf, tpr_svm_rbf, color='#1f77b4', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_svm_rbf)  # Blue color
plt.plot([0, 1], [0, 1], color='#ff7f0e', lw=2, linestyle='--')  # Orange color
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) - SVM with RBF Kernel')
plt.legend(loc='lower right')
plt.show()

4.5 Gradient Boosting¶

In [41]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize the Gradient Boosting model
gb_model = GradientBoostingClassifier(random_state=42)

# Train the model
gb_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_gb = gb_model.predict(X_test_scaled)

# Evaluate the Gradient Boosting model
accuracy_gb = accuracy_score(y_test, y_pred_gb)
conf_matrix_gb = confusion_matrix(y_test, y_pred_gb)
class_report_gb = classification_report(y_test, y_pred_gb)

# Print evaluation metrics
print("Accuracy Score (Gradient Boosting):", accuracy_gb)
print("\nConfusion Matrix (Gradient Boosting):")
print(conf_matrix_gb)
print("\nClassification Report (Gradient Boosting):")
print(class_report_gb)
Accuracy Score (Gradient Boosting): 0.956140350877193

Confusion Matrix (Gradient Boosting):
[[69  2]
 [ 3 40]]

Classification Report (Gradient Boosting):
              precision    recall  f1-score   support

           0       0.96      0.97      0.97        71
           1       0.95      0.93      0.94        43

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

In [42]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the confusion matrix heatmap
plt.figure(figsize=(8, 6))
conf_matrix_gb = confusion_matrix(y_test, y_pred_gb)
sns.heatmap(conf_matrix_gb, annot=True, fmt='d', cmap='coolwarm',
            xticklabels=['Class 0', 'Class 1'],
            yticklabels=['Class 0', 'Class 1'])
plt.title('Confusion Matrix Heatmap (Gradient Boosting)')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
In [43]:
from sklearn.metrics import roc_curve, auc

# Compute ROC curve and AUC for Gradient Boosting
fpr_gb, tpr_gb, _ = roc_curve(y_test, gb_model.predict_proba(X_test_scaled)[:, 1])
roc_auc_gb = auc(fpr_gb, tpr_gb)

# Plot ROC curve for Gradient Boosting
plt.figure(figsize=(10, 6))
plt.plot(fpr_gb, tpr_gb, color='#1f77b4', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_gb)  # Blue color
plt.plot([0, 1], [0, 1], color='#ff7f0e', lw=2, linestyle='--')  # Orange color
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) - Gradient Boosting')
plt.legend(loc='lower right')
plt.show()

5.0 Final Results¶

In [44]:
print("Accuracy Score:", accuracy)
Accuracy Score: 0.9649122807017544
In [45]:
print("Accuracy Score (Logistic Regression):", accuracy_log_reg)
Accuracy Score (Logistic Regression): 0.9736842105263158
In [46]:
print("Accuracy Score (Random Forest):", accuracy_rf)
Accuracy Score (Random Forest): 0.9649122807017544
In [47]:
print("Accuracy Score (SVM with RBF kernel):", accuracy_svm_rbf)
Accuracy Score (SVM with RBF kernel): 0.9824561403508771
In [48]:
print("Accuracy Score (Gradient Boosting):", accuracy_gb)
Accuracy Score (Gradient Boosting): 0.956140350877193
In [49]:
# Accuracy scores
models = [
    "KNN",
    "Logistic Regression",
    "Random Forest",
    "SVM with RBF Kernel",
    "Gradient Boosting"
]

accuracies = [
    accuracy,
    accuracy_log_reg,
    accuracy_rf,
    accuracy_svm_rbf,
    accuracy_gb
]

# Sort the models and accuracies based on accuracy scores in ascending order
sorted_indices = np.argsort(accuracies)[::1]
sorted_models = np.array(models)[sorted_indices]
sorted_accuracies = np.array(accuracies)[sorted_indices]

# Create the horizontal bar plot
plt.figure(figsize=(10, 6))
bars = plt.barh(sorted_models, sorted_accuracies, color=['#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2'])

# Add percentage labels
for bar in bars:
    plt.text(
        bar.get_width() + 0.02,  # X position of the text
        bar.get_y() + bar.get_height() / 2,  # Y position of the text
        f"{bar.get_width()*100:.2f}%",  # Label as percentage
        va='center',  # Vertical alignment of the text
        ha='left'  # Horizontal alignment of the text
    )

plt.xlabel('Accuracy Score')
plt.ylabel('Models')
plt.title('Accuracy Scores of Different Models with Percentages (Sorted)')
plt.xlim([0, 1])  # Accuracy scores range from 0 to 1
plt.show()

6.0 Conclusion¶

This project aimed to enhance breast cancer diagnostics by applying various machine learning techniques to predict the malignancy of breast cancer cases based on Fine Needle Aspiration (FNA) image data. By evaluating five different classification models—Logistic Regression, K-Nearest Neighbors (KNN), Random Forests, Support Vector Machines (SVM) with the RBF kernel, and Gradient Boosting—the project sought to identify the most effective approach for improving diagnostic accuracy and aiding healthcare professionals in making more precise and timely diagnoses.

The results of the project indicate the following accuracy scores for each model:

  • Logistic Regression: 0.9737
  • K-Nearest Neighbors (KNN): 0.9649
  • Random Forests: 0.9649
  • Support Vector Machines (SVM) with RBF kernel: 0.9825
  • Gradient Boosting: 0.9561

Among the models tested, the SVM with RBF kernel achieved the highest accuracy score of 0.9825 (98.25%), demonstrating its superior performance in distinguishing between malignant and benign cases. Logistic Regression and KNN followed closely, both providing robust accuracy scores of 0.9737 and 0.9649, respectively. Random Forests also performed well with an accuracy of 0.9649, while Gradient Boosting, though slightly less accurate at 0.9561, still contributed valuable insights.

These findings underscore the potential of machine learning algorithms in advancing breast cancer diagnostics. The SVM with RBF kernel, in particular, shows promise as a tool for enhancing diagnostic accuracy and aiding in early detection. By integrating such machine learning models into clinical practice, there is potential for improved patient outcomes and a significant contribution to the ongoing research in cancer diagnosis.